Provide an overview of the project goals and motivation
The motivation for a project exploring the correlation between low birth weight, premature birth rate, cancer mortality rate, cardiovascular disease hospitalization rate, asthma hospitalization rate and PM2.5 concentration, as well as socioeconomic, racial, education, and gender factors across the 62 counties of New York State, stems from a compelling public health imperative.
All the outcomes of interest are critical indicators of population health. By examining the impact of fine particulate pollution, this project aims to uncover environmental health disparities that disproportionately affect under-served communities.
It seeks to inform public health policy by highlighting the need for targeted interventions that could mitigate the adverse effects of air pollution on vulnerable populations, including pregnant women and newborns. The analysis of socioeconomic and racial factors will offer insights into the complex interplay between environment and social determinants of health, potentially guiding urban planning and healthcare resource allocation.
Furthermore, by considering gender-specific vulnerabilities,the project aims to provide a comprehensive overview of the risk landscape, fostering community awareness and prompting action to improve air quality and health equity in one of the most densely populated urban areas in the world.
What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
Source, scraping method, cleaning, etc.
Click show on the right to view the code chunk for data
importing.
lowbirthweight <- read_csv("csv_NYC_lowbirthweight.csv")
pm2_5 <- read_csv("pm2.5.csv")
edu_NY <- read_excel("Edu_NY.xlsx")
race_NY <- read_excel("Race_NY.xlsx")
HHincome_NY <- read_excel("HHincome_NY.xlsx")
Age_NY <- read_excel("Age_NY.xlsx")
Sex_NY <- read_excel("Sex_NY.xlsx")
health <- read_excel("chir_current_data.xlsx")
uscounties <- read_csv("uscounties.csv") #Simplemaps.com
pm2_5 <- pm2_5 %>%
janitor::clean_names()%>%
select (county,value) %>%
rename (annual_pm2.5 = "value")
The dataset is obtained from EPH Tracking Website from CDC (https://ephtracking.cdc.gov/DataExplorer/). This has 62
rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- annual_pm2.5: annual estimated PM2.5 concentration at
each county (ug/m^3)
The following set of demographic datasets is obtained from Census Reporter webpage (https://censusreporter.org/).
edu <- edu_NY %>%
janitor::clean_names()%>%
mutate (percentage_high_education = (bach_male+master_male+prof_male+doct_male+bach_female+master_female+prof_female+doct_female)/total) %>%
filter(str_detect (name, " County, NY")) %>%
mutate(county = str_replace (name, " County, NY", "")) %>%
select(county,percentage_high_education) %>%
mutate(county = str_replace (county, "St.", "St")) %>%
mutate(county = str_replace (county, "Stuben", "Steuben"))%>%
select(county,percentage_high_education)
This has 62 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- percentage_high_education: percentage of population who
finished a higher education level (higher than a bachelor degree)
race_non_hisp_white <- race_NY %>%
janitor::clean_names()%>%
filter( x1 == "White Non-Hispanic") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_non_hisp_white") %>%
select (county,percent_non_hisp_white) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_non_hisp_white ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
race_non_hisp_black <- race_NY %>%
janitor::clean_names()%>%
filter( x1 == "Black Non-Hispanic") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_non_hisp_black") %>%
select (county,percent_non_hisp_black) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_non_hisp_black ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
race_hisp_white <- race_NY %>%
janitor::clean_names()%>%
filter( x1 == "White Hispanic") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_hisp_white") %>%
select (county,percent_hisp_white) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_hisp_white ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
race_hisp_black <- race_NY %>%
janitor::clean_names()%>%
filter( x1 == "Black Hispanic") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_hisp_black") %>%
select (county,percent_hisp_black) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_hisp_black ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
race_merge <- race_non_hisp_white %>%
inner_join(race_non_hisp_black, by = "county") %>%
inner_join(race_hisp_white, by = "county") %>%
inner_join(race_hisp_black, by = "county")
race_merge <- race_merge %>%
mutate (percent_other = 1 - percent_non_hisp_white - percent_non_hisp_black - percent_hisp_white - percent_hisp_black)
This has 62 rows and 6 columns of data. In which, 6 variables
are:
- county: NY county name
- percent_non_hisp_white: percentage of population who are
identified as Non-Hispanic White
- percent_non_hisp_black: percentage of population who are
identified as Non-Hispanic Black
- percent_hisp_white: percentage of population who are
identified as Hispanic White
- percent_hisp_black: percentage of population who are
identified as Hispanic Black
- percent_other: percentage of population who are
identified as any other ethinic groups
income <- HHincome_NY %>%
janitor::clean_names() %>%
filter (x1 == "percent_high_income") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_high_income") %>%
select (county,percent_high_income) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_high_income ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
This has 62 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- percent_high_income: percentage of population who are at
higher income households (>$75,000 annually)
age <- Age_NY %>%
janitor::clean_names() %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "median_age") %>%
select (county,median_age) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,median_age ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
This has 62 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- median_age: median age of the population at each
county
sex <- Sex_NY %>%
janitor::clean_names() %>%
filter (x1 == "Male:") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_male") %>%
select (county,percent_male) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_male ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
This has 62 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- percent_male: percentage of population who are identified
as male
lowbirthweight <- lowbirthweight %>%
janitor::clean_names()%>%
select (region_county, percentage)%>%
rename (county = "region_county", percent_lowbirthweight = "percentage")
This has 87 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- percent_lowbirthweight: percentage of children being born
being identified as low birth weight (<2,500g) (https://www.health.ny.gov/)
The following dataset is obtained from New York State Department of Health (https://www.health.ny.gov/) that contains 4 different health indicators that will complement with the low birthweight. These 5 will act as our outcomes of interest for our regression models.
health <- health %>%
janitor::clean_names()%>%
select (geographic_area,indicator_title,topic_area,rate_percent,measurement) %>%
filter( str_detect (geographic_area, " County")) %>%
mutate (county = str_replace (geographic_area, " County", "")) %>%
select (county, everything()) %>%
select (-geographic_area) %>%
filter(topic_area == "Cancer Indicators" | topic_area == "Respiratory Disease Indicators" | topic_area == "Cardiovascular Disease Indicators" | topic_area == "Maternal and Infant Health Indicators")
cancer <- health %>%
filter (topic_area == "Cancer Indicators") %>%
filter (indicator_title == "All cancer incidence rate per 100,000") %>%
select (-c(indicator_title, topic_area, measurement)) %>%
mutate(rate_percent = as.numeric(rate_percent)/10) %>%
rename (cancer_mortality_per_10k = "rate_percent")
resp <- health %>%
filter (topic_area == "Respiratory Disease Indicators") %>%
filter (indicator_title == "Asthma hospitalization rate per 10,000") %>%
select (-c(indicator_title, topic_area, measurement)) %>%
rename (asthma_hosp_rate_per_10k = "rate_percent")
cardio <- health %>%
filter (topic_area == "Cardiovascular Disease Indicators") %>%
filter (indicator_title == "Cardiovascular disease hospitalization rate per 10,000") %>%
select (-c(indicator_title, topic_area, measurement)) %>%
rename (cardio_hosp_rate_per_10k = "rate_percent")
maternal <- health %>%
filter (topic_area == "Maternal and Infant Health Indicators") %>%
filter (indicator_title == "Percentage of premature births with <37 weeks gestation") %>%
select (-c(indicator_title, topic_area, measurement)) %>%
rename (premature_percentage = "rate_percent")
They are:
- cancer_mortality_per_10k: percentage of cancer mortality
per 100 thousands people in each NY county (Cancer Indicator)
- asthma_hosp_rate_per_10k: percentage of asthma
hospitalization per 10 thousands people in each NY county (Respiratory
Disease Indicator)
- cardio_hosp_rate_per_10k“: percentage of
cardiovascular-disease-related hospitalization per 10 thousands people
in each NY county (Cardiovascular Disease Indicator)
- premature_percentage: percentage of children being born
prematurely (<37 gestational weeks) in each NY county
Here we perform inner_join() to create 2 bigger datasets
health_merge and demographic_merge. Then, we
join them with our lowbirthweight & pm2_5
tp make a finalized data frame called merge. And, we will
use this for regression model.
health_merge <- maternal %>%
inner_join(cancer, by = "county") %>%
inner_join(resp, by = "county") %>%
inner_join(cardio, by = "county")
demographic_merge <- age %>%
inner_join(sex, by = "county") %>%
inner_join(income, by = "county") %>%
inner_join(race_merge, by = "county") %>%
inner_join(edu, by = "county") %>%
mutate(county = str_replace (county, "St ", "St. "))
merge <- lowbirthweight %>%
inner_join(pm2_5, by = "county") %>%
inner_join(health_merge, by = "county") %>%
inner_join(demographic_merge, by = "county")
merge <- merge %>%
select (county, annual_pm2.5,everything())%>%
mutate(asthma_hosp_rate_per_10k = as.numeric(asthma_hosp_rate_per_10k))%>%
mutate(cardio_hosp_rate_per_10k = as.numeric(cardio_hosp_rate_per_10k)) %>%
mutate(premature_percentage = as.numeric(premature_percentage))%>%
mutate(cancer_mortality_per_10k = as.numeric(cancer_mortality_per_10k))
uscounties <- uscounties %>%
filter (state_id == "NY") %>%
select (county, lat, lng)
map <- merge %>%
inner_join(uscounties, by ="county")
write.csv(map, "NY_map.csv", row.names = FALSE)
This file is specifically made for the purpose of making map (in Shiny App).
Visualizations, summaries, and exploratory statistical analyses. Justify the steps you took, and show any major changes to your ideas.
merge_cont =
merge %>%
select(-county)
skim_table <- skim(merge_cont)
numeric_summary <- skim_table %>%
select(Variables = skim_variable, Nmiss = n_missing, Mean = numeric.mean, Median = numeric.p50, STD = numeric.sd)
kable(numeric_summary, "html", digits = 1) %>%
kable_styling(full_width = FALSE, position = "center", font_size = 15) %>%
column_spec(1, width = "3cm") %>%
column_spec(2, width = "3cm") %>%
column_spec(3, width = "3cm") %>%
column_spec(4, width = "3cm") %>%
column_spec(5, width = "2cm")
| Variables | Nmiss | Mean | Median | STD |
|---|---|---|---|---|
| annual_pm2.5 | 0 | 7.0 | 6.8 | 1.1 |
| percent_lowbirthweight | 0 | 7.2 | 7.3 | 1.5 |
| premature_percentage | 1 | 8.8 | 8.9 | 1.1 |
| cancer_mortality_per_10k | 1 | 68.0 | 68.9 | 8.0 |
| asthma_hosp_rate_per_10k | 3 | 4.4 | 3.7 | 3.8 |
| cardio_hosp_rate_per_10k | 0 | 151.1 | 154.9 | 25.9 |
| median_age | 0 | 41.9 | 42.1 | 3.9 |
| percent_male | 0 | 0.5 | 0.5 | 0.0 |
| percent_high_income | 0 | 0.4 | 0.4 | 0.1 |
| percent_non_hisp_white | 0 | 0.8 | 0.9 | 0.2 |
| percent_non_hisp_black | 0 | 0.1 | 0.0 | 0.1 |
| percent_hisp_white | 0 | 0.0 | 0.0 | 0.0 |
| percent_hisp_black | 0 | 0.0 | 0.0 | 0.0 |
| percent_other | 0 | 0.1 | 0.1 | 0.1 |
| percentage_high_education | 0 | 0.3 | 0.3 | 0.1 |
#numeric_summary_df <- as.data.frame(numeric_summary)
#summary_plot =
# plot_ly(numeric_summary_df, x = ~Mean, y = ~Variables, type = 'box', text = ~Variables, hoverinfo = 'mean', mean = "Mean") %>%
# layout(title = 'Mean',
# xaxis = list(title = 'Mean'),
# yaxis = list(title = 'Variables'))
#summary_plot
merge %>%
plot_ly(
x = ~reorder(county, percent_high_income),
y = ~percent_high_income,
type = "bar",
marker = list(color = "red1")
) %>%
layout(
title = "Percentage of High Income",
xaxis = list(title = "County Name", categoryorder = "total descending"),
yaxis = list(title = "Percentage"),
barmode = "stack"
)
race_plot <- merge %>%
select (county, percent_non_hisp_white, percent_non_hisp_black, percent_hisp_white, percent_hisp_black, percent_other) %>%
pivot_longer(
cols = starts_with("percent_"), names_to = "race", values_to = "percentage")
race_plot%>%
plot_ly(x = ~county, y = ~percentage, type = "bar",color = ~race,colors = "RdYlGn", hoverinfo = "y+name") %>%
layout(barmode = "stack",
title = "Racial Composition in NY county",
xaxis = list(title = "County"),
yaxis = list(title = "Percentage (%)"))
LBR_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~percent_lowbirthweight,
color = ~county,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Low Birthweight",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Percentage of Low Birthweight"),
showlegend = FALSE
)
LBR_graph
premature_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~premature_percentage,
color = ~county,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Premature Birth",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Rate of Premature Birth"),
showlegend = FALSE
)
premature_graph
asthma_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~asthma_hosp_rate_per_10k,
color = ~county,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Asthma Hospitalization",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Rate per 10k"),
showlegend = FALSE
)
asthma_graph
cancer_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~cancer_mortality_per_10k,
color = ~county,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Cancer Mortality",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Rate per 100k"),
showlegend = FALSE
)
cancer_graph
cardio_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~cardio_hosp_rate_per_10k,
color = ~county,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Cardiovascular Disease Hospitalization",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Rate per 10k"),
showlegend = FALSE
)
cardio_graph
fig <- plot_ly()
fig <- fig %>% add_trace(x = ~merge$county, y = ~merge$percent_lowbirthweight, type = "scatter", mode = "markers", name = "Low Birthweight Rate")
fig <- fig %>% add_trace(x = ~merge$county, y = ~merge$premature_percentage, type = "scatter", mode = "markers", name = "Premature Birth Rate")
fig <- fig %>% add_trace(x = ~merge$county, y = ~merge$asthma_hosp_rate_per_10k, type = "scatter", mode = "markers", name = "Asthma Hospitalization Rate")
fig <- fig %>% add_trace(x = ~merge$county, y = ~merge$cancer_mortality_per_10k, type = "scatter", mode = "markers", name = "Cancer Mortality Rate")
fig <- fig %>% add_trace(x = ~merge$county, y = ~merge$cardio_hosp_rate_per_10k, type = "scatter", mode = "markers", name = "Cardiovascular Disease Hospitalization Rate")
fig <- fig %>% layout(
title = "Overlay of 5 Outcomes",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Rate"),
showlegend = TRUE
)
fig
If you undertake formal statistical analyses, describe these in detail
After cleaning the data, we then look at the effect that the annual concentration of pm 2.5 separately has on multiple outcomes among 62 counties in New York, including low birth weight rate, premature birth rate, cancer mortality rate, asthma-related hospitalization rate, and cardiovascular-disease-related hospitalization rate.
Since all of the outcomes are measured as rates (i.e., continuous), we use linear regression models to assess the effect of annual concentration of pm 2.5 on the outcomes of interests, adjusting for age, sex, ethnicity, education, income. In our models:
Age is reported as a continuous variable that reflects the median age of each county.
Sex is reported as a continuous variable that reflects the percentage of male in each county. The variable that reflects the percentage of female in each county is not included in our model to avoid perfect multicollinearity.
Ethnicity is reported as a continuous variable that reflects the percentage of different ethnicity groups in each county, including Hispanic-Black, Hispanic-White, non-Hispanic-Black, non-Hispanic-White, and Others (if Asian, Native American, or belong to two or more race). The variable that reflects the percentage of people who belong to the category Others is not included in our model to avoid perfect multicollinearity.
Education is reported as a continuous variable that reflects the percentage of people who have obtained one or more higher education degrees in each county.
Income is reported as a continuous variable that reflects the percentage of household of each county that have annual income exceeding $75,000.
For each of the five outcomes of interest, we first start with the full model:
Outcome = Annual pm 2.5 concentration + median age + percentage of male + percentage of Hispanic-Black + percentage of Hispanic-White + percentage of non-Hispanic-Black + percentage of non-Hispanic-White + percentage of household income exceeding 75,000 USD + percentage of people obtained higher education
We then perform bidirectional stepwise selection on all five models and output the best models. The full model and the resulting best models after stepwise selection for each outcome are detailed below:
lm_pm2.5_birthweight_adjusted <-lm(percent_lowbirthweight ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
lm_pm2.5_birthweight_adjusted_best <- step(lm_pm2.5_birthweight_adjusted, direction = 'both', trace=FALSE)
summary(lm_pm2.5_birthweight_adjusted_best)%>%
tab_model()
| percent lowbirthweight | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 11.53 | 7.18 – 15.88 | <0.001 |
| median age | -0.11 | -0.21 – -0.02 | 0.024 |
| percent non hisp black | 8.05 | 1.34 – 14.76 | 0.019 |
| Observations | 62 | ||
| R2 / R2 adjusted | 0.274 / 0.249 | ||
The best model that was outputted by the stepwise regression is
low birth weight = 11.53 - 0.11(median age) + 8.05 (percent_non_hisp_black)
On average, for a county whose median age is 0 and has 0% of non-Hispanic-Black, the expected low birth weight rate is 11.5% (no practical meaning). Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the intercept is significantly different than 0.
On average, for every unit increase of median age (1 year), the expected low birth weight rate of the county decreases by 0.11%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
On average, for every unit increase in the percentage of non-Hispanic-Black, the expected low birth weight rate of the county increases by 8.05%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of non-Hispanic-Black is significantly different than 0.
The adjusted R-squared of 0.249 implies that 24.9% of the variation in the response variable can be explained by its linear relationship with the set of the 2 predictors (median age and percentage of non-Hispanic-Black).
lm_pm2.5_premature_adjusted <-lm(premature_percentage ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
lm_pm2.5_premature_adjusted_best <- step(lm_pm2.5_premature_adjusted, direction = 'both', trace=FALSE)
summary(lm_pm2.5_premature_adjusted_best)%>%
tab_model()
| premature percentage | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 4.88 | 0.96 – 8.80 | 0.016 |
| median age | 0.08 | -0.01 – 0.17 | 0.067 |
| percent non hisp black | 8.03 | 2.63 – 13.42 | 0.004 |
| Observations | 61 | ||
| R2 / R2 adjusted | 0.135 / 0.105 | ||
The best model that was outputted by the stepwise regression is
premature birth weight = 4.88 + 0.08(median age) + 8.03 (percent_non_hisp_black)
On average, for a county whose median age is 0 and has 0% of non-Hispanic-Black, the expected low birth weight rate is 4.88% (no practical meaning). Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the intercept is significantly different than 0.
On average, for every unit increase of median age (1 year), the expected low birth weight rate of the county decreases by 0.11%. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
On average, for every unit increase in the percentage of non-Hispanic-Black, the expected low birth weight rate of the county increases by 8.03%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of non-Hispanic-Black is significantly different than 0.
The adjusted R-squared of 0.105 implies that 10.5% of the variation in the response variable can be explained by its linear relationship with the set of the 2 predictors (median age and percentage of non-Hispanic-Black).
lm_pm2.5_cancer_adjusted <-lm(cancer_mortality_per_10k ~annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
lm_pm2.5_cancer_adjusted_best<-step(lm_pm2.5_cancer_adjusted, direction = 'both', trace=FALSE)
summary (lm_pm2.5_cancer_adjusted_best)%>%
tab_model()
| cancer mortality per 10 k | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | -0.18 | -51.73 – 51.36 | 0.994 |
| annual pm2 5 | 1.33 | 0.10 – 2.55 | 0.035 |
| median age | 1.31 | 0.98 – 1.65 | <0.001 |
| percent high income | -21.35 | -37.32 – -5.38 | 0.010 |
| percent non hisp white | 46.40 | 20.09 – 72.71 | 0.001 |
| percent non hisp black | 53.63 | 2.27 – 104.99 | 0.041 |
| percent hisp white | 102.38 | 10.35 – 194.41 | 0.030 |
| percent hisp black | -178.74 | -417.28 – 59.79 | 0.139 |
| percent male | -57.41 | -140.92 – 26.09 | 0.174 |
| Observations | 61 | ||
| R2 / R2 adjusted | 0.809 / 0.780 | ||
The best model that was outputted by the stepwise regression is
cancer mortality rate = -1.85 + 13.27(annual pm 2.5) + 13.14(median age) - 213.49(percent high income) + 464.03(percent non hisp white) + 536.28(percent non hisp black) + 1023.82(percent hisp white) - 1787.43(percent hisp black) - 574.15(percent male)
On average, for a county whose annual pm 2.5 concentration is 0, median age is 0, has 0% of high income, has 0% of non-Hispanic-Black, has 0% of non-Hispanic-White, has 0% of Hispanic-Black, has 0% of Hispanic-White, has 0% of male, the expected low birth weight rate is -1.85 case per 100,000 people (no practical meaning). However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the intercept is significantly different than 0.
On average, for every unit increase of annual pm 2.5 concentration, the expected cancer mortality rate of the county increases by 13.27 deaths per 100,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for annual pm 2.5 concentration is significantly different than 0.
On average, for every unit increase of median age (1 year), the expected cancer mortality rate of the county decreases by 13.14 cases per 100,000 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
On average, for every unit increase in the percentage of high income, the expected cancer mortality rate of the county decreases by 213.49 cases per 100,000 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of high income is significantly different than 0.
On average, for every unit increase of percentage of non-Hispanic-White, the expected cancer mortality rate of the county increases by 464.03 deaths per 100,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of non-Hispanic-White is significantly different than 0.
On average, for every unit increase of percentage of non-Hispanic-Black, the expected cancer mortality rate of the county increases by 536.28 deaths per 100,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of non-Hispanic-Black is significantly different than 0.
On average, for every unit increase of percentage Hispanic-White, the expected cancer mortality rate of the county increases by 1023.08 deaths per 100,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for percentage of Hispanic-White is significantly different than 0.
On average, for every unit increase of percentage Hispanic-Black, the expected cancer mortality rate of the county decreases by 1787.43 deaths per 100,00 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient percentage of Hispanic-Black is significantly different than 0.
On average, for every unit increase of percentage of male, the expected cancer mortality rate of the county decreases by 574.15 deaths per 100,00 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for the percentage of male is significantly different than 0.
The adjusted R-squared of 0.780 implies that 70.8% of the variation in the response variable can be explained by its linear relationship with the set of the 8 predictors (annual pm 2.5 conc., median age, percentage of high income, percentage of 4 ethnicity groups, percentage of male).
lm_pm2.5_asthma_adjusted <-lm(asthma_hosp_rate_per_10k ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
lm_pm2.5_asthma_adjusted_best <- step(lm_pm2.5_asthma_adjusted, direction = 'both', trace=FALSE)
summary(lm_pm2.5_asthma_adjusted_best)%>%
tab_model()
| asthma hosp rate per 10 k | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 1.36 | 0.06 – 2.67 | 0.041 |
| percent high income | 4.65 | 0.32 – 8.99 | 0.036 |
| percent non hisp black | 13.52 | 5.77 – 21.26 | 0.001 |
| percent hisp black | 366.67 | 314.84 – 418.50 | <0.001 |
| percentage high education | -5.31 | -9.55 – -1.06 | 0.015 |
| Observations | 59 | ||
| R2 / R2 adjusted | 0.935 / 0.930 | ||
The best model that was outputted by the stepwise regression is
asthma hospitalization rate = 1.36 + 4.65(percent high income) + 13.52(percent non hisp black) + 366.67 (percent hisp black) - 5.31(percent high education)
On average, for a county who has zero percent of high income, has zero percent of non-Hispanic-Black, has zero percent of Hispanic-Black, has zero percent of high education, the expected asthma hospitalization rate is 1.85 case per 10,000 people (no practical meaning). Since the p-value is smaller han 0.05, we have sufficient evidence to claim that the intercept is significantly different than 0.
On average, for every unit increase in the percentage of high income, the expected asthma hospitalization rate of the county increases by 4.65 cases per 10,000 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for the percentage of high incomeis significantly different than 0.
On average, for every unit increase of percentage of non-Hispanic-Black, the expected cancer mortality rate of the county increases by 13.52 deaths per 10,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for the percentage of non-Hispanic-Black is significantly different than 0.
On average, for every unit increase of percentage of Hispanic-Black, the expected cancer mortality rate of the county increases by 366.67 deaths per 10,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for the percentage of Hispanic-Black is significantly different than 0.
On average, for every unit increase of percentage of high education, the expected cancer mortality rate of the county decreases by 5.31 deaths per 10,00 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for the percentage of higher education is significantly different than 0.
The adjusted R-squared of 0.930 implies that 93.0% of the variation in the response variable can be explained by its linear relationship with the set of the 4 predictors (percentage of high income, percentage of non-Hispanic-Black, percentage of Hispanic-Black, percentage of higher education).
lm_pm2.5_cardio_adjusted <-lm(cardio_hosp_rate_per_10k ~annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
lm_pm2.5_cardio_adjusted_best<-step(lm_pm2.5_cardio_adjusted, direction = 'both', trace=FALSE)
summary (lm_pm2.5_cardio_adjusted_best)%>%
tab_model()
| cardio hosp rate per 10 k | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 157.52 | -155.87 – 470.91 | 0.318 |
| annual pm2 5 | 6.97 | -0.14 – 14.08 | 0.055 |
| median age | 1.73 | 0.01 – 3.44 | 0.049 |
| percent non hisp white | 117.26 | -30.46 – 264.97 | 0.117 |
| percent non hisp black | 222.99 | -83.73 – 529.70 | 0.151 |
| percent hisp white | 439.56 | -49.52 – 928.63 | 0.077 |
| percentage high education | -148.61 | -224.74 – -72.49 | <0.001 |
| percent male | -405.53 | -920.48 – 109.43 | 0.120 |
| Observations | 62 | ||
| R2 / R2 adjusted | 0.329 / 0.242 | ||
The best model that was outputted by the stepwise regression is
cardiovascular disease hospitalization rate = 157.52 + 6.97(annual pm 2.5) + 1.73(median age) + 117.26 (percent non hisp white) + 222.99(percent non hisp black) + 439.56(percent hisp white) - 148.61.43(percent high education) - 405.53(percent male)
On average, for a county whose annual pm 2.5 concentration is 0, median age is 0, has zero percent of non-Hispanic-White, has zero percent of non-Hispanic-Black, has zero percent of Hispanic-White, has zero percent of high education, has zero percent of male, the expected cardiovascular disease hospitalization rate is 157.52 case per 10,000 people (no practical meaning). However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the intercept is significantly different than 0.
On average, for every unit increase of annual pm 2.5 concentration, the expected cardiovascular disease hospitalization rate of the county increases by 6.97cases per 10,000 people. However, since the p-value is smaller than 0.05, we do not have sufficient evidence to claim that the beta coefficient for annual pm 2.5 concentration is significantly different than 0.
On average, for every unit increase of median age (1 year), the expected cardiovascular disease hospitalization rate of the county decreases by 1.73 cases per 10,000 people. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
On average, for every unit increase of percent of non-Hispanic-White, the expected cardiovascular disease hospitalization rate of the county increases by 177.26 cases per 10,000 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for the percent of non-Hispanic-White is significantly different than 0.
On average, for every unit increase of percent of non-Hispanic-Black, the expected cardiovascular disease hospitalization rate of the county increases by 222.99 cases per 10,000 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for percent of non-Hispanic-Black is significantly different than 0.
On average, for every unit increase of percent of Hispanic-White, the expected cardiovascular disease hospitalization rate of the county increases by 439.56 cases per 10,000 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for percent of Hispanic-White is significantly different than 0.
On average, for every unit increase of percent of higher education, the expected cardiovascular disease hospitalization rate of the county decreases by 148.61 cases per 10,000 people. Since the p-value is larger than 0.05, we have sufficient evidence to claim that the beta coefficient for percent of higher education is significantly different than 0.
On average, for every unit increase of percentage of male, the expected cardiovascular disease hospitalization rate of the county decreases by 405.53 cases per 10,000 people. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for percentage male is significantly different than 0.
The adjusted R-squared of 0.242 implies that 24.2% of the variation in the response variable can be explained by its linear relationship with the set of the 7 predictors (annual pm 2.5 concentration, median age, percent non-Hispanic-White, percent non-Hispanic-Black, percent Hispanic-White, percentage of higher education, percentage of male).
What were your findings? Are they what you expect? What insights into the data can you make?